This report explores a dataset which contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating betweern 0 (very bad) and 10 (very excellent).
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
We dropped the column X which appears to be the row number. quality is an ordered categorical variable with score between 0 and 10. It’s interesting to see that our wine experts don’t rate the wines as extreme as of score 0(very bad), 1, 2, or 10 (very excellent). The actual range is from 3 to 9 with median at 6. The rest of variables are continuous variables which makes sense since they represents the amount of the corresponding substance in the wine, based on physicochemical tests.
From this histogram of quality counts, we can see it’s a normal distribution with mean (solid line) and median (dashed line) with almost the same value.
Most of the white wine has a quality of 6, and second place is 5.
##
## FALSE TRUE
## 3838 1060
##
## midiocre premium
## 3838 1060
I remember my wine teacher always talks about Pareto principle (80/20 rule) in the wine industry (Yes, I had a wine teacher). Wine of quality 7, 8, 9 makes up 27.6% of the total number of white wines rated. Therefore, we will consider quality of 7, 8, 9 as premium.
##
## FALSE TRUE
## 4613 285
All wines contains sulphur dioxide in various forms, collectively known as sulphites. Even in completely unsulphured wine it is present at concentration of up to 10 mg/L. Commercially-made wines contain from ten to twenty times that amount. (Source: morethanorganic)
Reasons why SO2 is not desirable in wine:
According to EU law, the maximum permitted level of SO2 in white/rose wine is 210 mg/l. As you can see in the first histogram, there are 285 wines exceeded this limit. And we can observe that all three of them have a right-skewed distribution. This might be due to the restriction of the sulphate and most of the vineyards would obey the rules and avoid exceeding the limit.
## [1] 3.188267
## [1] 3.18
In this set of histograms, we explore the acidity in wines. We have the first three variables which are the amount of corresponding acid found in the wines. The fourth variable pH indicates the acidity level where 7 is neutral and smaller the value is, more acidic the liquid is. We observe a right skewed distribution of the first three and a normal distribution of PH with median and mean at 3.18 (acidic). It makes sense to have the PH histogram not right skewed as the above 3 ones since the outliers in the acidity histogram would have a lower PH value (tail on the left of PH histogram).
Some people believe that sweeter a wine is, the more alcohol it should contain. We cannot tell this just by looking at the histogram here yet. We will more into it in the bivariate plot section. Here we can see both residual sugar and salt have very right skewed distribution. And the amount of salt is really tiny for all white wines with maximum of 0.346 g/L. Histogram of alcohol is a bit right skewed with peaks at around 9 - 9.5 %, it also is quite uniform distributed other than the peak points. Most wines have alcohol level of 8.5 - 12 %.
The white wine quality dataset consists of 4898 observations and 12 variables. Each observation is a white variant of the Portuguese “Vinho Verde” wine. Among the 12 variables, there are 11 input variables (numeric) which represent the amount of corresponding substance existing in the wine based on physicochemical tests. The output variable quality is based on sensory data (median of at least 3 evaluations made by wine experts), and it is an ordered categorical data with range between 0 (very bad) and 10 (very excellent).
The main feature of interest is quality. I am curious in knowing how does the amount of other factors affect the rating from the wine experts.
By reading the description of the dataset here, I think volatile acidity, citric acid, free sulfur dioxide, total sulfur dioxide, density may support my investigation. Because they seems to affect the smell, taste and color. density may contribute to the effect of “wine curtains” which is also a essential part of wine tasting.
Yes, I created a new categorical variable type to indicate whether a wine is premium or mediocre where premium wines are the ones rated above 7 quality and mediocre the rest.
I dropped the column X which appears to be the row number. I also changed quality to a ordered categorical variable with score between 0 and 10. Many variables have a right skewed distribution with outliers on the far end of the tail. However, quality is quite normal distribution with no extreme value like 0, 1, 2, or 10. I haven’t removed the “outliers” from the dataset because at this point I am not sure if their extremeness contribute to the feature of interest.
To have broad overview of what variables might be interesting, a scatterplot matrix sounds like a good idea.
An interesting observation is that the outliers they happen more at the middle range qualities (5, 6, 7) than the extreme values. Very small amounts of outliers can be observed for 9-quality or 3-quality wines.
If you look at the boxplot at quality 9 for each factor, notice that the “box” is generally smaller than other qualities (especially density, sulfur.dioxide). This suggests that there is a specific set of charateristics in order to be rated as an “very excellent” quality Portuguese “Vinho Verde” white wine. At this point, I’m impressed by the wine experts who rated these wines. Just by blind tasting, they can detect the excellent wine with the exact right amount of each substances.
From the set of boxplots, we can observe that alcohol seems to be appreciated. With higher alcohol level, the median rating of quality is generally higher.
pH, fixed.acid and citric acid shows slight positive correlation as well.
On the other side, sulfur.oxide, sugar, and density are not appreciated, negatively correlated to quality.
We can observe that there is strong (0.853) correlation between density and residual.sugar which is what I suspected before.
It’s only nutural to see that free.sulfur.dioxide and total.sulfur dioxide has a correlation of 0.61.
Also as suspected before, sugar and density has a strong correlation of 0.839. All other factors somewhat contribute to density a bit as we can see the correlation ranges from 0.15 to 0.839 for density with other factors except for the factor volatile acidity (corr: 0.0271).
Surprisingly, alcohol and residual.sugar have a negative correlation of -0.427. alcohol and density also have a strong negative correlation of -0.711, which makes sense since density and residual.sugar are highly positively correlated.
From the boxplots on the quality column, we suspect that alcohol, total.sulfur.dioxide, and density have some effects on the ratings of wine quality by the wine experts.
The strongest relationship I found is between residual.sugar and density. They have a correlation of 0.853. density and alcohol also has a strong negative correlation of -0.78.
To make this set of plots, outliers (residual.sugar > 30) are removed from the dataset.
We can see that the strong correlation between density and sugar doesn’t change at no matter what quality.
Observe the second plot, we can see that at same level of sugar, premium wines are less dense than midiocre wines. Mediocre wine also have a bigger range of residual.sugar level (the outliers we didn’t show are also mediocre wines).
total.sulfur.dioxide and density are not as correlated sugar with density but we can observe the same trend that the line of fit for premium is lower than midiocre.
sugar and density seems to strengthen each other in terms of looking at quality. At the same sugar level, premium wines tend to have less density than mediocre wines. Extremely high sugar level has lower chance of being rated as excellent wines.
Wines at quality levele 5, 6, 7 always have extreme level in features like residual.sugar and sulfur.dioxide. This is surprising as they are not rated as bad wines (level 2,3) but OK wines.
This is a histogram of quality counts of the wines. The dashed lines is the median and solid line is mean. We can see that it’s a normal distribution with mean and median at 6.
This is a correlation plot of all the numeric variables. We can see there is strong positive correlation between density and residual.sugar. density and alcohol has strong negative correlation. alcohol tend to have negative correlation with the the rest of the variables. There is self explainatory postive correlation between free.sulfur.dioxide and total.sulfur.dioxide. total.sulfur.dioxide and density also have some interesting correlation.
From this plot, we can see that at same level of sugar, premium wines are less dense than midiocre wines. Mediocre wine also have a bigger range of residual.sugar level (the outliers we didn’t show here are also mediocre wines).
At the beginning it was hard to understand what does each numeric variables mean and how could they affect the quality of wine. After doing some research and read more carefully on the documentation of the dataset, it became more clear how I could explore this dataset. Another struggle is that there is really subtle differences in the amount of variables, you can see from the scatterplots that all the points are kind of all cluster together, it’s hard to visualize when you just put quality as color in the same scatterplot. Maybe some tranformation of data could be used in the future, to make it possible to visually separate the clusters.